A Tool for Text Comparison
نویسندگان
چکیده
Text reuse is commonplace in academia and the media. An efficient algorithm for automatically detecting and measuring similar/related texts would have applications in corpus linguistics, historical studies and natural language engineering. In an effort to explore the issue of text reuse, a tool, named Crouch, has been developed based on the TESAS system (Piao 2001) for comparing and measuring text similarity and derivation in sets of texts. Given a set of candidate source and derived texts, this tool maps related sentences between a pair of texts employing n-gram, stemming and synonym matching approaches. Crouch examines the textual similarity of individual pairs of texts, and also clusters pairs of texts in a collection of texts according to their similarity. The comparison is directional, in that the comparison from a derived text to its source generally produces a higher score than a comparison in the opposite direction. This presents the possibility of detecting the direction of text derivation. The tool displays its comparison of a given pair of texts in a graphical interface to help users to analyse the texts. Furthermore, as the tool is written in Java and fully supports Unicode, it can be applied to many languages. At Lancaster University, it is currently being used to help detect related English newspaper articles in 17th century newspapers.
منابع مشابه
High capacity steganography tool for Arabic text using 'Kashida'
Steganography is the ability to hide secret information in a cover-media such as sound, pictures and text. A new approach is proposed to hide a secret into Arabic text cover media using "Kashida", an Arabic extension character. The proposed approach is an attempt to maximize the use of "Kashida" to hide more information in Arabic text cover-media. To approach this, some algorithms have been des...
متن کاملPlagiarism checker for Persian (PCP) texts using hash-based tree representative fingerprinting
With due respect to the authors’ rights, plagiarism detection, is one of the critical problems in the field of text-mining that many researchers are interested in. This issue is considered as a serious one in high academic institutions. There exist language-free tools which do not yield any reliable results since the special features of every language are ignored in them. Considering the paucit...
متن کاملSystemic Functional Linguistics as a Tool of Text Analysis for Translation
Translation, ipso facto, is an understanding and a transferal of meaning from one language into another. Therefore, it may be fitting to conclude that a suitable semantic theory should underpin any attempt to that end. This paper advocates implementing Systemic Functional Linguistics (henceforth SFL) which subscribes to a view of language as a "meaning-potential". In fact, Halliday and Matthies...
متن کاملIntelligent Text Comparison in Software Validation
The intelligent comparison tool (ICT) was developed to perform intelligent comparisons of text files. The purpose of the tool is to reduce the differences found when comparing two text files to only the meaningful or important ones. This tool contrasts with conventional differencing tools that find all literal differences. This tool was developed to help in the validation of software revisions ...
متن کاملCorpus based coreference resolution for Farsi text
"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003